(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1063

kiukchung · 2025-05-05T20:05:14Z

Summary:
Fixes macos unittest failures: https://github.com/pytorch/torchx/actions/runs/14844253481/job/41674243877

When looking into the test failure I noticed two things:

local_scheduler was trying to SIGTERM the process group by passing the replica's pid: os.killpg(replica.pid, signal.SIGTERM) . Changed to call os.kill. (note that os.killpg is not available on iOS which is why the test was failing).
The torchx.runner.test.api_test.test_empty_session_id() test case doesn't wait for the echo test command to finish hence there was a race condition where in certain cases the runner's __exit__() SIGTERMs the replica pids but since the local_scheduler was (wronfully) using os.killpg not os.kill it threw an uncaught error in iOS.

Differential Revision: D74197282

…g SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running Summary: Fixes macos unittest failures: https://github.com/pytorch/torchx/actions/runs/14844253481/job/41674243877 When looking into the test failure I noticed two things: 1. `local_scheduler` was trying to SIGTERM the process group by passing the replica's pid: `os.killpg(replica.pid, signal.SIGTERM)` . Changed to call `os.kill`. (note that `os.killpg` is not available on iOS which is why the test was failing). 2. The `torchx.runner.test.api_test.test_empty_session_id()` test case doesn't wait for the `echo` test command to finish hence there was a race condition where in certain cases the runner's `__exit__()` SIGTERMs the replica pids but since the `local_scheduler` was (wronfully) using `os.killpg` not `os.kill` it threw an uncaught error in iOS. Differential Revision: D74197282

facebook-github-bot · 2025-05-05T20:05:32Z

This pull request was exported from Phabricator. Differential Revision: D74197282

kiukchung · 2025-05-05T20:13:43Z

duplicate of #1062. Closing.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2025

facebook-github-bot added the fb-exported label May 5, 2025

kiukchung closed this May 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1063

(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1063

Uh oh!

kiukchung commented May 5, 2025

Uh oh!

facebook-github-bot commented May 5, 2025

Uh oh!

kiukchung commented May 5, 2025

Uh oh!

Uh oh!

(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1063

(torchx/local_scheduler) Use os.kill instead of os.killpg when sending SIGTERM to the replica pid. Add runner.wait() for torchx.runner.test.api_test#test_empty_session_id to gracefully wait for the replicas to finish running #1063

Uh oh!

Conversation

kiukchung commented May 5, 2025

Uh oh!

facebook-github-bot commented May 5, 2025

Uh oh!

kiukchung commented May 5, 2025

Uh oh!

Uh oh!